Sorting is at the core of many database operations, such as index creation,sort-merge joins, and user-requested output sorting. As GPUs are emerging as apromising platform to accelerate various operations, sorting on GPUs becomes aviable endeavour. Over the past few years, several improvements have beenproposed for sorting on GPUs, leading to the first radix sort implementationsthat achieve a sorting rate of over one billion 32-bit keys per second. Yet,state-of-the-art approaches are heavily memory bandwidth-bound, as they requiresubstantially more memory transfers than their CPU-based counterparts. Our work proposes a novel approach that almost halves the amount of memorytransfers and, therefore, considerably lifts the memory bandwidth limitation.Being able to sort two gigabytes of eight-byte records in as little as 50milliseconds, our approach achieves a 2.32-fold improvement over thestate-of-the-art GPU-based radix sort for uniform distributions, sustaining aminimum speed-up of no less than a factor of 1.66 for skewed distributions. To address inputs that either do not reside on the GPU or exceed theavailable device memory, we build on our efficient GPU sorting approach with apipelined heterogeneous sorting algorithm that mitigates the overheadassociated with PCIe data transfers. Comparing the end-to-end sortingperformance to the state-of-the-art CPU-based radix sort running 16 threads,our heterogeneous approach achieves a 2.06-fold and a 1.53-fold improvement forsorting 64 GB key-value pairs with a skewed and a uniform distribution,respectively.
展开▼